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ABSTRACT 

Detection of copy number variation (CNV) in DNA 
has recently become an important method for 
understanding the pathogenesis of cancer. While 
existing algorithms for extracting CNV from micro- 
array data have worked reasonably well, the trend 
towards ever larger sample sizes and higher reso- 
lution microarrays has vastly increased the chal- 
lenges they face. Here, we present Segmentation 
analysis of DNA (SAD), a clustering algorithm con- 
structed with a strategy in which all operational 
decisions are based on simple and rigorous appli- 
cations of statistical principles, measurement 
theory and precise mathematical relations. 
Compared with existing packages, SAD is simpler 
in formulation, more user friendly, much faster and 
less thirsty for memory, offers higher accuracy and 
supplies quantitative statistics for its predictions. 
Unique among such algorithms, SAD's running 
time scales linearly with array size; on a typical 
modern notebook, it completes high-quality CNV 
analyses for a 250 thousand-probe array in ~1 s 
and a 1.8 million-probe array in ~8s. 

INTRODUCTION 

Amplification or deletion of chromosomal segments can 
lead to abnormal mRNA transcript levels and results in 
malfunctioning of cellular processes. Locating such 
chromosomal aberrations in comparative genomic DNA 
samples, or copy number variation (CNV) (1^1), is an 
important step in understanding the pathogenesis of 



many diseases, especially cancer. Array comparative 
genomic hybridization (CGH) is a high-throughput tech- 
nique developed for measuring such changes (5-7). CGH 
arrays using Bacterial Artificial Chromosome (BAC) 
clones have resolutions of the order of 1Mb (6). Those 
using cDNA and oligonucleotide as probes (1,8) are less 
robust than BACs for large segments, but offer much 
higher resolutions (in the order of 50-100kb). In particu- 
lar, oligonucleotide arrays allow design flexibility and 
greater coverage and provide good sensitivity (8). Tiling 
on custom arrays is also available now for even finer reso- 
lution of specific regions and allow the detection of 
micro-amplifications and deletions (9,10). The drastic im- 
provement in resolution has led to a corresponding 
increase in the number of probes on an array; modern 
high-resolution arrays now easily exceed one million 
probes. Such arrays exact a severe requirement on the 
speed and accuracy of algorithms used to analyze them 
and have vastly reduced the usefulness of existing algo- 
rithms that are 0(N 2 ) — 7Y is array size — in computation 
time or memory requirement. Here, we propose a novel 
algorithm, segmentation analysis of DNA (SAD), for 
studying CNV in high-resolution arrays. 

For a probe, the log2-ratio of intensities from a pair of 
microarrays is termed a datum. Based on our observation 
that datum errors tend to be normally distributed, we 
designed SAD with three features, respectively involving 
the use of: (i) the Gaussian distribution function 
(Gaussian) as a probability density function (PDF) for 
evaluating the true value of a measured datum; (ii) a clus- 
tering procedure based on a technique we call pair-wise 
Gaussian merging (PGM); (iii) z-statistic for making clus- 
tering decisions. Details are given in Methods. The oper- 
ational principles of PGM are schematically illustrated 
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in Figure 1. In this case, the original 10 datums are pre- 
dicted by SAD to have an underlying structure of two 
segments. SAD has one essential parameter, the threshold 
z-value zq, and an optional one, the sampling size N s . z Q 
defines a significance level p 0 for making clustering deci- 
sions and for calling CNVs. N, is used for speeding up 
SAD. 

We show in the following sections that, compared with 
algorithms found in the literature, SAD has a simpler but 
more rigorous formulation, is easier to understand and 
simpler to use, provides clearer statistical interpretation 
for its results, requires less memory, offers better 
accuracy and is vastly faster in computation speed. 



MATERIALS AND METHODS 

Normal distribution of error 

Data not having any CNV are best for demonstrating 
normal distribution of error. For this reason we 
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Figure 1. Schematic illustration of PGM applied to genome segmenta- 
tion. Frames on the left, with the x-axis indicating relative probe 
position on the genome, display datums. as solid grey squares and 
clusters, as black crosses with errorbars; frames on the right display 
associated Gaussians. (a) Each datum is treated as a Gaussian with 
same variance, (b) Datums 'o' and 'p', the nearest neighbouring pair, 
are merged in the first iteration, (c) and (d) Second and third iterations, 
respectively, (e) Merging stops after eight iterations when the remaining 
pair of clusters are considered resolvable. 



contrasted pairs of replicate arrays among each of the 
four triplicate array sets, NA15510_Nsp, NA15510_Sty, 
NA10851_Nsp and NA10851_Sty (henceforth the Redon 
data set), that were produced on the Affymetrix 500K EA 
platform in a CNV study (3). Because each set has three 
contrasted pairs, the sets give a total of 12 error distribu- 
tions. In Figure 2, the error distributions, after standard- 
ization and normalization, are compared to standard 
normal distributions in terms of the Kolmogorov- 
Smirnov statistic (KS). The small KS values confirm 
that Gaussian is an excellent approximation to the error 
distributions. 

We examined error properties in more detail using the 
Affymetrix 500K copy number sample data set (http:// 
www.affymetrix.com). Figure 3a shows the log2-ratio 
profile of chromosome 2 from the (CRL-5868D, 
CRL-5957D) STY pair and our selection of two 
~8000-datum sections of obviously distinct means. 
Figure 3b compares the log2-ratio distributions of the 
two sections with their respective Gaussian approxima- 
tions, G(y;0.35,(0.22) 2 ) and G(y;-0A3,(0.23y), which 
have different means but similar variances. These two 
sections and an artificial 8000-datum section of 
randomly generated G(j;0,(0.22) 2 ) noise were used to 
study the sample-size and spatial dependence of error. 
Each section is partitioned into subsections of width 4', 
i = 3 to 8, plus a discarded remainder. The error of each 
subsection is measured using Equation (4). Each section at 
each i thus has an error distribution whose mean and 
standard deviation are plotted in Figure 3c. The two 
sections are shown to have spatial as well as statistical 
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Figure 2. Normality test of datum error using the Redon data set. In 
terms of KS, the normalized standard error distributions, shown as 
grey histograms, are compared to standard normal distributions, 
shown as black lines. 
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Figure 3. Sample-size and spatial independence of variation. Data are 
from the Affymetrix 500K copy number sample data set. (a) The 2 
sections and a remainder of chromosome 2 from the 
(CRL-5868D,CRL-5957D) STY pair, (b) log2-ratio distributions of 
sections 1 and 2 compared with their Gaussian approximations, (c) 
Subsection error distributions computed from subsections of the two 
sections and, for comparison, an artificial 8000-datum section generated 
with Gaussian noise. 



properties similar to that of the artificial data. In particu- 
lar, this implies that, for the array data, statistical errors 
(excluding breakpoints) are more or less uniformly 
distributed. 



Pair-wise Gaussian merging 

Given a measured value v, the conditional probability 
for its true value being y is Pr(y\v) = Pr(y H v)/Pr(v). 
Similarly, given a set of independently measured values 
S2 = {v < |i=l,...,w}, we have Pr(y\ £2) = Pr(y n Q)/Pr(ti) 
and, from the independence of events, 
Pr{y n £2) = n;=i Pr(y n v t ), Pr(Q) = nr=i PKv,). There- 
fore, Pr(y\h) = Y\" =l Pr(y\vj). In case of continuous 
variables, the probability that the true value lies in the 
interval y to y + dy is Pr(y\dy\v) = dyD{y\£i), with 
D(y\Q) oc n)Li D(y\Vj), where the D's are PDFs. Given 
that errors are normally distributed with initial variance 
<x 2 , we approximate D(y\Vj) by a Gaussian 
G(y; u/.ct 2 ) = (oV^r 1 exp(-(j' - vj) 2 12b 2 '). Repeatedly 
using the relation that a product of two Gaussians is 
another Gaussian we have 
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d(y\Q) — G(y\n,o 2 ); fi = a 2 = a 2 /w. 



(2) 



We call this method of merging Gaussians to obtain a 
PDF from a set of measurements Gaussian merging 
(GM). The formulations of both fi and a are intuitively 
understood: fi is the mean of the measured values and er 2 is 
inversely proportional to sample size, as expected. 

To allow the possibility that £2 comprises multiple 
subsets each the manifest of a different true value, we 
conduct a two-sample z-test (for independent samples 
with equal variances), before merging two Gaussians 
using a z-value, here called the resolvability, 
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where G fe = G(j;/Zfc,o|), k—l and 2. That z r follows a 
standard normal distribution is shown in Supplementary 
Data. The corresponding P-value of z, tests the null hy- 
pothesis that G\ and G 2 have the same true value. Given 
threshold resolvability z 0 , we say G\ and G 2 are resolvable 
if |z,.(Gi,G 2 )|> z 0 , in which case the two Gaussians are kept 
separate, and are unresolvable and merged otherwise. The 
following four-step procedure, which we call PGM, parti- 
tions £2 into resolvable subsets: (i) Estimate the variance of 
each datum, (ii) Select z 0 . (hi) Identify the unresolvable 
pair of Gaussians with the smallest z r and use GM to 
merge the pair, (iv) Iterate step (iii) until all remaining 
pairs are resolvable. PGM is a type of agglomerative hier- 
archical clustering using z r as distance. In the present ap- 
plication, only spatially contiguous datums (except when 
separated by an outlier) are merged, and the partitioned 
subsets correspond to segments of different log2-ratios. 

The SAD algorithm: clustering 

SAD has two clustering modes: the linear mode (LM) for 
low-resolution arrays or when computation time is not a 
concern, and the parallel mode (PM) otherwise. LM has a 
single parameter z n while PM has an additional parameter 
N s whose default value of 100 is highly recommended. The 
steps in LM are: (i) Computation of a. Let {v,|z'= 1,^} be 
the initial data of log2-ratio, qt=Vi+i — v,- and SD q be the 
standard deviation of the g/s, then 



SDJV2. 



(4) 



ct measures datum error and is sensitive only to the exist- 
ence of breakpoints, which are assumed to be sparse. 
Treat each datum as a single-datum cluster and assign 
G(y;vj,a 2 ) to the z'-th datum-cluster, (ii) Selection of z 0 . 
This stipulates when PGM iteration stops and addresses 
the statistical issues discussed in the following subsection, 
(iii) PGM Phase I. Perform chromosome-wide PGM itera- 
tively to all contiguous cluster pairs. At the end of this 
phase each remaining single-datum cluster is a 'loner' 
whose existence prevents the merging of its two neigh- 
bouring clusters even if they are resolvable, (iv) PGM 
Phase II. Along with contiguous pairs, continue step (iii) 
to merge loner-divided pairs. After a loner-divided pair is 
merged the dividing loner becomes an 'outlier' and is 
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excluded from subsequent calculation. At the end of this 
stage each of the resultant clusters is a 'segment' with an 
associated Gaussian G(y;n,o 2 ) serving as a PDF for its 
true value, (v) Normalization. Perform genome-wide 
PGM on the entire set of segments to merge contiguous 
as well as unconnected segment pairs. Identify the largest 
resultant cluster and denote its mean by p., here called the 
'baseline'. The baseline will be taken as the reference for 
CNV significance test. 

As PGM involves very little computation, LM is inher- 
ently a fast algorithm. On the other hand, owing to the 
iterative procedure, the problem size is 0(N\ implying 
long computation time when N is large. PM reduces the 
problem size to 0(N) with little sacrifice in accuracy. In 
that case, a sampling size N s is selected (by the user) and 
the various steps in LM are adjusted as follows. In (v), jl is 
computed using only the widest N s segments. This reduces 
problem size from 0(N 2 eg ), where N seg is the number of 
resultant segments, to 0(N 2 ). In (iii) and (iv), prior to 
merging the entire current cluster set is partitioned to 
subsets of N s contiguous clusters, plus a remainder. The 
subsets are processed in parallel and the most unresolvable 
pair in each subset, if there is any, is merged. Thereafter 
the subsets of clusters (some of which have been reduced 
in size through merging) are joined with the remainder 
circularly, with the beginning of the remainder taken as 
the starting point, and readied for a new round of parti- 
tion and merging. This is a dynamical procedure resulting 
in a different partition in each iteration. The problem size 
for each of the N/N x subsets is 0(N 2 ), making the total 
problem size 0(NN s ). 

The SAD algorithm: CNV calling and selection of z,o 

After clustering, consider two contiguous segments: a 
narrow segment S\ of Gi = G(y',fii,a 2 /w{) and a much 
wider non-CNV segment s 2 of G 2 = G{y\jl,a 2 /wt). Let 
H a be the null hypothesis that s\ is non-CNV (i.e. the 
true value of S\ is jl). An independent one-sample z-test 
using a z-value, here called the 'aberrance', 

z a (G\) = (/xi - pOVwT/ff, (5) 

yields a P-value for testing H a , as is expected by the 

central limit theorem. From Equations (3 and 5), 
because w 2 ^> W\, we have 

z r {G\,G 2 ) « (mi - £)V^i7^ = Za(G{). (6) 

The lower bound for \z r {G\,G 2 )\, z 0 , is therefore also the 
approximate lower bound for |z„(Gi)|. We therefore 
employ p 0 , the corresponding P-value of z 0 , as the signifi- 
cance level for testing H a . We call s\ a CNV if \z a (Gi)\>z 0 . 
More specifically, we call the segment a 'gain' if 
z a ((?i) > 2 Q , or a 'loss' if z a {G\) < — z 0 . 

Because | /x 1 — \x\ja is just the signal to noise ratio 
(SNR) of Ji, Equation (6) leads to 

>T>-VSNR. (7) 

That is, if SNR is known, z 0 also sets an approximate 
lower bound for CNV width. 



Software availability 

The SAD program is available for download at: http:// 
www.sybbi.ncu.edu.tw/software.htm or upon request by 
email at: pairwise.gaussian.merging@gmail.com. 

RESULTS 

In Lai et al. (11) (hereafter referred to as LJKP) the per- 
formances of 1 1 CNV algorithms — 3 smoothing-only (SO) 
algorithms, lowess, wavelet (12) and quantreg (13) and 8 
estimation-performing (EP) algorithms, CGHseg (14), 
CBS (15), ChARM (16), ACE (17), HMM (18), GLAD 
(19), GA (20) and CLAC (21)— were compared using 
simulated data for testing receiver operating characteristic 
(ROC) as well as real Glioblastoma Multiforme (GBM) 
data. LJKP found that the overall top three EP perform- 
ers were CGHseg, CBS and GLAD. In Fiegler et al. (22) 
two more recently developed EP algorithms, CNVfinder 
(22) and SW-ARRAY (23), were compared in accuracy 
using real data. Among these algorithms only CALC 
and ACE provide quantitative statistics. 

We test SAD against the 10 EP algorithms in ROC. The 
SO algorithms were excluded because they do not expli- 
citly address breakpoints. The ones rated accurate, 
CGHseg, CBS and GLAD, were further compared to 
SAD in speed and memory. In addition we validated 
SAD on low- and high-resolution data sets. We designate 
a SAD run in LM by SAD(z 0 ,-) and in PM by 
SAD(z 0 ,Ay. 

Accuracy 

We calculated (details in Supplementary Data) the ROC 
curves of SAD the same way as in LJKP except that for 
better statistics we generated 10000 instead of 100 
simulated chromosomes (of 100 datums each) for each par- 
ameter set in each setting. The results (Supplementary 
Figure SI) indicate that a higher z 0 is more suitable for 
easy settings (wide CNV and large SNR) while a 
lower z 0 better facilitates CNV detection in difficult 
settings (narrow CNV or small SNR). Table 1 compares 
SAD(z 0 ,100), z 0 =1.5, 2.0 and 4.0, in area-under-curve 



Table 1. Comparison in AUC value of ROC, of SAD against 
existing algorithms for two easy settings, ( SNR, width) = (4,20) and 
(3,40), and two difficult settings, (2,5) and (1,10) 



Algorithm 


(4,20) 


(3,40) 


(2,5) 


(1,10) 


SAD(1.5,100) 


0.99 


0.99 


0.93 


0.83 


SAD(2.0,100) 


0.99 


0.99 


0.92 


0.84 


SAD(4.0,100) 


0.99 


0.99 


0.71 


0.59 


ACE 


0.99 


0.92 


0.73 


0.57 


CBS 


0.99 


0.99 


0.75 


0.59 


CGHseg 


0.99 


0.99 


0.94 


0.78 


ChARM 


0.93 


0.91 


0.50 


0.50 


CLAC 


0.97 


0.95 


0.84 


0.68 


GA 


0.99 


0.99 


0.55 


0.51 


GLAD 


0.99 


0.99 


0.56 


0.51 


HMM 


0.99 


0.99 


0.65 


0.54 


SW-ARRAY 


0.86 


0.82 


0.53 


0.52 


CNVfinder 


0.97 


0.95 


0.90 


0.75 
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(AUC) value with the 10 EP algorithms for two easy 
settings, (SNR,width) = (4,20) and (3,40), and two difficult 
settings, (2,5) and (1,10). Numbers for the eight 
LJKP-tested algorithms were read from Figure 2 in 
LJKP. Numbers for SW-ARRAY and CNVfmder were 
calculated using their reportedly optimal parameter 
values. In the easy settings, SAD(1. 5^1.0,100), CBS, 
CGHseg, GA, GLAD and HMM perform well. In the 
difficult settings, SAD(1. 5-2. 0,100) is the best performer 
and CGHseg is next. Although CNVfmder performs 
above average in the difficult settings, it is below average 
in the easy settings. 

In PM, higher computation speed is facilitated by using 
a smaller N s . Because PM alters the clustering order 
relative to that in LM, this can induce error when N s is 
too small. We tested SAD in this regard and find that 
overall error is negligible when 7V S >100 (Supplementary 
Figure S2). 

Speed and memory 

All calculations reported here were carried out on a 
computer with Intel Core 2 Duo T7500 2.2G (L2:4M) 
CPU, 2GBs of DDRII memory, and uses Windows XP 
as operating system. All programs ran as a single thread 
and uses 50% of the CPU. Our SAD program is written in 
Visual C++. The other algorithms were tested with 
provided programs at default parameter values. The 
simulated chromosomes were generated with SNR = 2. 
Each simulated chromosome had either one or two 
gains. For planting the gains each chromosome was 
divided into five same-width sections. The second section 
was amplified in one-gain cases, and the second and the 
forth sections were amplified in two-gain cases. 
Computation time t was measured for each case; the dif- 
ference in r between one and two gains reflects the depend- 
ence of speed on genomic profiles. Memory test was read 
from the processes tab of Windows Task Manager and 
involves two steps: data loading and data processing. 
The reading between the two steps, denoted by kj, is 
memory used for program and data. The maximum 
reading during data processing was recorded as k„ and 
the difference k p = k b — k cI was taken to be the maximum 
memory needed for data processing. The power-law expo- 
nents y r and y K were derived from the N dependences of x 
and k p , respectively. 

We compared SAD(10,100) to CGHseg, CBS and 
GLAD and show the results in Figure 4. We see that: (i) 
SAD is vastly faster than the others; at Nx 10 6 it is 
already two orders of magnitude faster than CBS, its 
closest competitor, (ii) In computation time SAD is 
O(N) while GLAD and CGHseg are 0(N 2 ). CBS, 
claimed to be 0(N) at low resolution (24), becomes 
0(N 2 ) at jV x 5 x 10 . (iii) Speed dependence on genomic 
profile, reflected by the difference between the 1-gain 
results and the 2-gain results, is significant for CBS, 
minor for GLAD and CGHseg, and negligible for SAD. 
(iv) SAD requires the least amount of memory, overall (at„) 
as well as for data-processing (k p ). (v) In memory require- 
ment SAD and GLAD scale as G(N), CBS displays irregu- 
larity, and CGHseg scales as 0(N 2 ). On a computer with 2 



GBs of memory, CGHseg ceases to function when TV 
exceeds about 16 000. For this reason CGHseg is not con- 
sidered for further comparison. 

Using real data, we ran SAD(10,100) on a 1.8- 
million-probeset Affymetrix Genome-Wide Human SNP 
Array 6.0 hybridized with a colorectal cancer sample, and 
measured t = 8 seconds and k- 0 =323MBs. 

Validation on a low-resolution data set 

We used a 2276-BAC public data set from the NIGMS 
Human Genetics Cell Repository (25) (henceforth the 
Snijders dataset) to perform low-resolution validation of 
SAD and to demonstrate the utility of z 0 for limiting CNV 
width. The dataset corresponds to 15 human cell strains. 
As identified by spectral karyotyping, each cell strain has 
either one or two CNVs and eight of the CNVs on six 
strains were detected to be whole-chromosome. We set a 
value of z 0 using Equation (7). For trisomic segments, the 
data set has SNR x 0.58/0.09, where 0.58 » log 2 (3/2) is 
approximately the log2-ratio of a trisomic segment and 
0.09 is the value for a obtained from Equation (4). To 
detect a minimum CNV width between one datum 
(because one-datum CNVs are likely to be outliers) and 
two, 6.4<z 0 <9.1 is required. We therefore used 
SAD(8,100) for this calculation. 

Because the data set had previously been examined by 
GLAD (19) and CBS (15), we compared the three sets of 
results in full details in Supplementary Table S 1 , and sum- 
marize the comparison as follows. (1) SAD(8,100) detects 
more CNVs than GLAD and CBS do. (2) SAD(8,100) 



SAD(10,100) 1-gain 
SAD(10,100) 2-gain 
GLAD 1-gain 
GLAD 2-gain 
CBS 1-gain 
CBS 2-gain 
CGHseg 1-gain 
CGHseg 2-gain 



(c) 



10 3 
10 2 
10 1 
10° 





10 2 10 3 10 4 10 5 10 5 
N 



10 2 10 3 10" 10 5 
N 



Figure 4. Comparisons of SAD to CGHseg, CBS and GLAD in speed 
and memory requirement, (a) Computation time x versus N. (b) 
Power-law exponent y z for x derived from (a), (c) Overall memory k„ 
versus N. (d) Data-processing memory k p versus N. (e) Power-law ex- 
ponents y K for K p derived from (d). 
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gives far fewer false-positives; the average numbers of false 
positive breakpoints per cell strain are 2/15, 46/15, 26/15, 
37/9 and 16/9 for SAD(8,100), GLAD(X' = 8), 
GLAD(A.'=10), CBS(a = 0.01) and CBS(a = 0.001), re- 
spectively. (3) SAD alone assigns a z-value to each CNV 
for assessing significance. (4) SAD(8,100) alone detects 
whole-chromosome CNVs on whose detection GLAD 
and CBS are silent because they are based on breakpoint 
detection within chromosomes. 



Validation on a high-resolution dataset 

In Redon et al. (3), 43 genomic regions were examined by 
SYBR real-time PCR or MassSpec to validate the respect- 
ive CNV calls for NA15510 vs NA10851 on the 
Affymetrix 500K EA platform. We used three of these 
regions, cnp8, cnp23 and cnp36, respectively determined 
in (3) to be gain, loss and gain, to validate SAD and to 
demonstrate the utility of z 0 for characterizing CNV sig- 
nificance. In Figure 5, the results of three runs, 
SAD(10,100), SAD(8,100) and SAD(6,100), on the first 
Sty replicates of the Redon dataset are respectively 
shown in frame sets (a), (b) and (c). At z 0 =10 
(Figure 5a) only cnp36 is detected with z a — 10.6. When 
z 0 is lowered to 8 (Figure 5b), cnp23 is detected with 
z„= — 8.2. When z 0 is further lowered to 6 (Figure 5c), 
cnp8 is detected with z„=7.4. 
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Figure 5. A high-resolution validation test for SAD on 3 genomic 
regions with known CNVs, whose positions are shown as thick black 
segments in the frames. The three sets of frames are for the three runs: 
(a) SAD(10,100); (b) SAD(8,100); and (c) SAD(6,100). Data and SAD 
predictions are respectively shown as solid grey squares and black lines. 
Shown above each CNV detected by SAD is its aberrance z a . 



DISCUSSION 

We have demonstrated that by virtue of its accuracy, par- 
simony in memory use and speed, SAD can manage the 
challenges analyzing modern high-resolution microarrays 
significantly better than existing algorithms. Algorithmi- 
cally SAD is easy to understand because it employs 
fundamental principles of statistics and precise but very 
simple mathematics [as compared to the mathematics in 
the formulation of, say, GLAD (19)]. SAD makes all 
internal decisions based on statistics and provides an 
external quantitative statistic. With only two user-tunable 
parameters, z 0 and N s , the meanings of which are both 
intuitively accessible, SAD is also the easiest to use. 
Users can select z 0 , the primary parameter, based on 
their requirement for CNV significance or CNV width. 
We recommend setting the second parameter, N s , to 100. 
This guarantees good accuracy and a computation time 
that is 0{N). 

Quantitative statistics provide the basis on which a level 
of confidence may be assigned to each inference and for 
setting a priority for experimental confirmation for such 
inferences. All measurements, especially those involving 
microarrays, carry inherent statistical error. SAD 
quantifies such errors as data uncertainty, tracks the 
latter throughout a clustering process using exact 
mathematical relations, and provides z-values for 
assessing CNV significance. The z-values, when used for 
downstream calculations such as the identification of re- 
current aberrations using multiple arrays, allows the initial 
uncertainty to be passed on further. 

SAD is an application build on PGM. The upgrading of 
SAD computation time from 0(N Z ) to O(N) is a conse- 
quence of the parallel processing made possible by the 
employment of agglomerative hierarchical clustering in 
PGM. The superior accuracy of SAD results from the 
exploitation by PGM of a common trait seen in most 
systems: that measurement errors are normally 
distributed. The operating principle of SAD is accessible 
to the user because in PGM the resolving power used for 
determining breakpoints is controlled via an intuitive stat- 
istic threshold. These properties of PGM promise its use- 
fulness and wide application, beyond CNV, in the general 
analysis of microarray data. 
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